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Abstract 

In this paper, a reduced dimensionality representation is learned from multiple views of the processed data. 
These multiple views can be obtained, for example, when the same underlying process is observed using several 
different modalities, or measured with different instrumentation. The goal is to effectively utilize the availability 
of such multiple views for various purposes such as non-linear embedding, manifold learning, spectral clustering, 
anomaly detection and non-linear system identification. The proposed method, which is called multi-view, exploits 
the intrinsic relation within each view as well as the mutual relations between views. This is achieved by defining 
a cross-view model in which an implied random walk process is restrained to hop between objects in the different 
views. This multi-view method is robust to scaling and it is insensitive to small structural changes in the data. 
Within this framework, new diffusion distances are defined to analyze the spectra of the implied kernels. The 
applicability of the multi-view approach is demonstrated for clustering, classification and manifold learning using 
both artificial and real data. 


I. Introduction 

High dimension big data exist in various fields and it is difficult to analyze them as is. Extracted features 
are useful in analyzing these datasets. Some prior knowledge or modeling is required in order to identify 
the essential features. On the other hand, dimensionality reduction methods are purely unsupervised aiming 
to find a low dimensional representation that is based on the intrinsic geometry of the analyzed dataset that 
includes the connectivities among multidimensional data points within the dataset. A “good” dimensional¬ 
ity reduction methodology reduces the complexity of a data processing while preserving the coherency of 
the original data such that clustering, classification, manifold learning and many other data analysis tasks 
can be applied effectively in the reduced space. Many methods such as Principal Component Analysis 
(PCA) [00], Multidimensional Scaling (MDS) [0, Local Linear Embedding Q, Laplacian Eigenmaps 0|, 
Diffusion Maps (DM) |5l and more have been proposed to achieve dimensionality reduction that preserve 
its data coherency. Exploiting the low dimensional representation yields various applications such as face 
recognition that is based on Laplacian Eigenmaps Non-linear independent component analysis with 
DM 0, Musical Key extraction using DM flU, and many more. The DM framework extends and enhances 
ideas from other methods by utilizing a stochastic Markov matrix that is based on local affinities between 
multidimensional data points to identify a lower dimension representation for the data. All the mentioned 
methods do not consider the possibility of having more than one view to represent the same process. An 
additional view can provide meaningful insight regarding the dynamical process that has generated and 
governed the data. 

In this paper, we consider learning from data that is analyzed by multiple views. The goal is to effectively 
utilize multiple views such as non-linear embedding, multi-view manifold learning, spectral clustering, 
anomaly detection and non-linear system identification to achieve better analysis of high dimension big 
data. Most dimensionality reduction methods suggest to concatenate the datasets into a single vector space. 
However, this methodology is sensitive to scalings of each data component. It does not utilize for example 
the fact that noise in both datasets could be uncorrelated. It assumes that both datasets lie in one high 
dimensional space which is not always true. 
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The problem of learning from two views has been studied in the field of spectral clustering. Most of these 
studies have been focused on classification and clustering that are based on spectral characteristics of the 
data while using two or more sampled views. Some approaches, which address this problem, are Bilinear 
Model (91, Partial Least Squares (TOl and Canonical Correlation Analysis iflTTl . These methods are powerful 
for learning the relation between different views but do not provide separate insights or combined into 
the low dimensional geometry or structure of each view. Recently, a few kernel based methods (e.g (T2l l 
propose a model of co-regularizing kernels in both views in a way that resembles joint diagonalization. 
It is done by searching for an orthogonal transformation that maximizes the diagonal terms of the kernel 
matrices obtained from all views. A penalty term, which incorporates the disagreement between clusters 
from the views, was added. Their algorithm is based on alternating maximization procedure. A mixture of 
Markov chains is proposed in (131 to model multiple views in order to apply spectral clustering. It deals 
with two cases in graph theory: directed and undirected graph where the second case is related to our 
work. This approach converges the undirected graph problem to a Markov chains averaging where each is 
constructed separately within the views. A way to incorporate a given multiple metrics for the same data 
using a cross diffusion process is described in lfl4l . They define a new diffusion distance which is useful 
for classification, clustering or retrieval tasks. However, the proposed process is not symmetrical thus does 
not allow to compute an embedding. An iterative algorithm for spectral clustering is proposed in lfl5l . 
The idea is to iteratively modify each view using the representation of the other view. The problem of two 
manifolds, which were derived from the same data (i.e two views), is described in (T6t . This approach 
is similar to Canonical Correlation Analysis (ITll that seeks a linear transformation that maximizes the 
correlation among the views. It demonstrates the power of this method in canceling uncorrelated noise 
present in both views. Furthermore, [16] applies its method to a non-linear system identification task. A 
similar approach is proposed in (T8l . It suggests data modeling that uses a bipartite graph and then, based 
on the ‘minimum-disagreement’ algorithm, partitions the dataset. This approach attempts to minimize the 
cluster’s disagreement between two views. The study presented in lfl9ll utilizes the agreement also called 
consensus between different views to extract the geometric information from all views. The framework 
takes advantage of properties of the Mahalanobis distance to compute a robust multi-view kernel. 

The problem of multi-view dimensionality reduction was also studied using Gaussian Process Latent 
Variable Models (20ll . The work by (2TI uses a Gaussian process regression to learn common hidden struc¬ 
ture. The studies by (22ll and (23ll demonstrate the capabilities of such models for extracting meaningful 
parameters from images. A related work by (241 attempts to maximize the mutual information between 
the sampled views and the latent variables. Studies such as (25l . (26l . (27l use a probabilistic CCA to 
factorize the data to a common and view specific information. 

In this work, we present a framework based on the construction in [18] and show that this approach 
is a special case of a more general diffusion based process. We build and analyze a new framework 
that generalizes the random walk model while utilizing multiple views. Our proposed method utilizes 
the intrinsic relation within each view as well as the mutual relations between views. The multi-view 
is achieved by defining a cross diffusion process in which a special structured random walk is imposed 
between the various views. Within this framework, new diffusion distances are defined to analyze the 
spectra of the new kernels and compute the infinitesimal generator to where the multi-view-based our kernel 
converges. The constructed multi-view kernel matrix is similar to a symmetric matrix thus guarantees real 
eigenvalues and eigenvectors, this property enables us to define an multi-view embedding. The advantages 
of the proposed method for manifold learning and spectral clustering are explored using vast experiments. 

The paper has the following structure: Background is given in section |TTJ Section [HI] presents and 
analyzes the multi-view framework. Section |IV-F| studies the asymptotic properties of the proposed kernel 

presents the experimental results. 


K (Eq. ({ 5 })). Section 


VI 
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II. Background 

A. General dimensionality reduction framework 

Consider a high dimensional dataset X = (aq, x 2 , cc 3 , x m} G E MxJV , 

Xi G M. N , i = 1, ..., M. The goal is to find a low dimensional representation Z = { z\ , z 2 , 23 , •••, zm} G E Mx5 
2 ; e M S , z = 1,..., M, such that S -C N and the local connectivities among the multidimensional data 
points are preserved. This problem setup is based on the assumption that the data is represented (viewed) 
by a single vector space (single view). 


B. Diffusion Maps (DM) 

DM 01 is a dimensionality reduction method that finds the intrinsic geometry in the data. This 
framework is highly effective when the data is densely sampled from some low dimensional manifold 
that is not linear. Given a high dimensional dataset X, the DM framework contains the following steps: 

1) A kernel function /C : X x X — > E is chosen. It is represented by a matrix K G E MxM which 
satisfies for all ( x i: Xj ) G X the following properties: Symmetry: K t J = tC(x t . xf) = JC(xj, xf), 
positive semi-definiteness: vjKvi > 0 for all rq G E M and K(xi,xf) > 0. These properties guar¬ 
antee that the matrix K has real eignenvectors and non-negative real eigenvalues. Gaussian kernel 
is a common example where I\ tt] = exp{— j with an L 2 norm as the affinity measure 
between two data vectors; 

2) By normalizing the kernel using D where D t l = we compute the following matrix elements: 


P- J = P(x„x j ) = [D- 1 K], J . (1) 

The resulting matrix P x G E MxM can be viewed as the transition kernel of a (fictitious) Markov 
chain on X such that the expression [{P x )%,j = Ptix^xf) describes the transition probability from 
point Xi to point x, in t steps. 

3) Spectral decomposition is applied to matrix P x or to one of its powers (P x f to obtain a sequence of 
eigenvalues {A m } and normalized eigenvectors {f> m } that satisfies P x 'f rn = \ in ip rn . m = 0,..., M — 1; 

4) Define a new representation for the dataset X 

**(*<) : 1 —> [A*i^i[z], A^ 2 [i], A|^ 3 [*], G E M_1 , (2) 

where t is the selected number of steps and ip m [i\ denotes the i th element of 
The main idea behind this representation is that the Euclidian distance between two data points in 
the new representation is equal to the weighted L 2 distance between the conditional probabilities 
Pt(xi ,:), and Pt(xj ,:), i,j = 1, ...,M (the i-th and j-th rows of P 1 ). The following is referred as 
the Diffusion Distance 


(*<>*() 


*«(<*) - «'«(^)ll 2 = E «(«] - ) 2 

m> 1 


\\Pt(xi,:) ( 3 ) 


where W is a diagonal matrix with elements W = M t,z . This equality is proven in 0/. 

5) The desired accuracy 8 > 0 is chosen for the diffusion distance defined by Eq. ([3]) such that 
s(S,t) = maxjT G N such that \\(f > 5|Ai|*}. By using S, a new mapping of s(S,t) dimensions is 
defined as 

¥ t S) : X [Ai^[z], A^ 2 (z), A^ 3 [z], ..., Kf s \i]] T G . 

This approach has been found useful in various fields. As previously noted, it is limited to a single view 
representation. A common extension of this approach to multiple views is to use a data concatenation 
from all views and then apply the diffusion framework. This method assumes orthogonality of the sampled 
dimensions which is an unrealistic assumption in many cases. Furthermore, this approach can create 
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redundancy in some dimensions and requires scaling of each dimension separately such that none is 
preferable over the others. Previous studies such as ll28ll . Il29ll apply the DM framework to each view 
individually and then incorporated the learned mapping from various views. However, they do not exploit 
the mutual relations which might exist between the different views to create and utilize the correct mapping. 

III. Multi-view dimensionality reduction 

Problem Formulation: Given multiple sets of observations X 1 ,1 = 1 ,...,L. Each view is a high 
dimensional dataset X 1 = {x\, x l 2 , x l 3 ,..., x l M } G W MxNl , N[ is the dimension of each feature space. Note 
that a bijective correspondence between views is assumed. For each view l = 1,..., L, we seek for a lower 
dimensional representation that preserves the interactions between multidimensional data points within a 
given view X 1 and among the views {X 1 , X 2 ,..., X L j. 


A. Multi-view Diffusion Maps 

We begin by generalizing the DM framework for handling a multi-view scenario. Our goal is to 
impose a random walk model using the local connectivities between data points within all views. Our 
way to generalize the DM framework is by restraining the random walker to “hop” between views in 
each step. The construction requires to choose symmetrical positive semi-definite kernels for each view 
K) : X 1 x X 1 —> R , l = 1,..., L, we use the Gaussian function 


K\j = exp{ 


I X i X \ I 
2 


■}, 


then the multi-view kernel is formed by the following matrix 


(4) 


0 MxM 

KK 2 

KK 

... KK~ 

KK 

0 MxM 

KK 

... KK 1 

KK 

KK 2 

0 MxM 

... KK 1 

KK' 

KK 2 

KK 

■■■ 0 MxM-_ 


Finally, by using the diagonal matrix D where D ul = ^K l v the normalized row-stochastic matrix is 


defined as 


P = D K P ltJ = 

D 


( 6 ) 


where the m,l block is a square M x M matrix located at 

[1 + (m — 1 )M, 1 + (l — 1 )M], l = 1,..., L. This block describes the probability of transition between view 
X m and X 1 . 


B. Alternative multi-view approaches 

In this section, we describe two additional methods for incorporating different views. We do not analyze 
these approaches but use them as references for comparisons in the experimental evaluations. 

1. Kernel Product DM (KP): The kernel matrix elements are multiplied element wise K° = K 1 o K 2 o 
... o K l , K°j = Kjj ■ Kfj ■ ... • Kj-j. Then, they are normalized by the sum of rows. The resulting row 
stochastic matrix is 


where />;, = T_K; 


J' 


3 


( 7 ) 
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Lema 1. In the special case of a Gaussian kernel with o\ = a 2 = ... = <Jl in Eq. 0. the resulting matrix 
K is equal to the matrix K w constructed using the concatenated vector Wi = [{x}) T ,(ccf) T ] T such 


that Kf- = exp{- 


II Wi-W 


2<r,2 


—}. The scale is set to 


Oill 


\ = \J L - a i 

\ *=1 


( 8 ) 


where the last equality holds only for this special case. 

This approach, which corresponds to 0, will be referred as the Kernel Product DM in section [VI 


A 


2. Kernel Sum DM (KS): The sum kernel is defined as K + = J2i=i K By normalizing the sum kernel 
by the diagonal matrix D + = 'ffKf-, we get 


Pti = [D + ~ l K + }ij. 

This random walk sums the step probabilities from each view. This approach is proposed in llT3ll . 


(9) 


C. Probabilistic interpretation of P 


In our proposed construction (Eqs. a§ and jhj)), the entries [ 
i,j = M, the transition probability from node x] to node x) ii 


[P ] h j = pt ix), xf) denote for each 
in t time steps by “hopping” between 
the views X 1 and X 1 , l — 2 ,L in each time step, where L is the total number of views. Note that due 


to the block-anti-diagonal structure of K (and P (Eq. (|6J))), this probability is zero for t = 1. However, for 
higher values of t, this probability is nonzero describing a time transition from view X 1 through any view 
X 1 , l = 2, ..., L and back to X 1 . In the same way, [P ] i+ q_iyM,j+{i-i)M = Pt { x \, x \) denotes the transition 
probability from node x\ to node x l :J , = time steps. Likewise, [Pfi+d-i)- m,j+( m -1 j m = 

Pt{x\, xf) denotes the transition probability from node x\ to node xf, i, j — 1 ,..., M, l m in t time 
steps. 

1) Smoothing effect t = 1: For simplicity let us examine the term pt(x[, xf ), l f rn for t = 1. The 


transition probability for t = 1 is 


Pi(x l i,xf) = 


y k 1 k 

/ yg l.S t 


m 

S,j 


D. 


Iff 


This probability takes into consideration all the various connectivities of node x\ to node x l s and the 
connectivities of the corresponding node xf to the destination node xf. The proposed multi-view approach 
has a smoothing effect in terms of the transition probability. By smoothing effect, we mean that the 
probability of transitioning from x\ to xf could be larger than zero even if KP = 0 and Kf = 0. 
Assume that there is a subset S = {si,...,Si?} such that K\ S} > 0 and Kfj > 0,/ = 1,..,F, by 
definition of the multi-view probability we get that p\ (x\, xf) > 0. 

Figure [l] illustrates the multi-view transition probabilities compared to a single view approach using 
two deformed Swiss-roll manifolds (L = 2). In each view, there is no probability of transition from one 
side of the gap to the other. The multi-view transition probability is non-zero for points at both sides of 
the gap. This smoothing effect occurs because the gap is located near different points on both views, thus 
allowing the multi-view kernel to smooth the nonlinear gap. 
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Fig. 1: Top left: Non-smooth Swiss Roll sampled from View-I (X), colored by the single view probability 
of transition (t = 1) from X\ to x : . Top right: second Swiss Roll sampled from View-II ( Y ), colored by 
the single view probability of transition (t = 1) from y 1 to y.. Bottom left: the first Swiss Roll colored 
the multi-view probabilities of transition (t = 1) from Xi to y . This point x\ denoted with an arrow 
in the top left figure. Bottom right: a low dimensional representation extracted based on the multi-view 


transition matrix P (Eq. 


6 ). 


2) Increasing the diffusion step t: Under the stochastic Markov model assumption, increasing the power 
of the matrix P spreads the probability along the data points based on the connectivities in all views. This 
probability spread as describe in Q reduces the influence of high eigenvectors on the diffusion distance 
(Eq. (flQ|)). This implies that the eigenvectors corresponding to low eigenvalues have a low-frequency 
content, whereas the ei gen vectors corresponding to the high eigenvalues describe the oscillatory behavior 
of the data 0. In Fig. |2j we present the eigenvalues of the matrix P at different values of t. For the 
experiment we have generated L = 3 Swiss rolls with M = 1200 data points each. It is evident that the 
numerical rank of P decreases for higher values of t. 



Fig. 2: The decay of the eigenvalues for increasing powers of the matrix P. 
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D. Multi-view diffusion distance 

In a variety of real data types, the Euclidean distance does not provide a sufficient information about the 
intrinsic relations between data points. The Euclidean distance is highly sensitive to scaling and rotations 
of multidimensional data points. Tasks such as classification, clustering or system identification require a 
measure for the intrinsic connectivity between data points. This type of measure is only satisfied locally 
by the Euclidean distance in the high dimensional ambient space. The multi-view diffusion kernel (defined 
in section JTlI-A| )) describes all the small local connections between data points. The row stochastic matrix 
P (Eq. (6)) incorporates all the possibilities for having a transition in t time steps between data points 
that are hopping between both views. For a fixed value t > 0, two data points are intrinsically similar if 
the conditional distributions p t (*,,:) = [P] i . and p t (x :j ,:) = [P].. are similar. This type of similarity 
measure indicates that the points Xi and Xj are similarly connected to several mutual points. Thus, 
they are connected by a geometrical path. In many cases, a small Euclidean distance can be misleading 
due to the fact that two data points can be “close” without having any geodesic path that connects 
them. Comparing the transition probabilities is more robust as it takes into consideration all of the local 
connectivities between the compared points. Therefor, even if two points do not have a small Euclidean 
distance between them, they may have many common neighbors and thus have a low diffusion distance. 

Based on this observation, by expanding the single view construction given in Q5j, we define the 
weighted inner view diffusion distances for the first view as 




x h x l 


L-M , r -t 

A (1 P \i,k 


E 

k=1 


[P kkf 


Mk) 


= \\(e i -ef T P 


( 10 ) 


where 1 < i,j < M, e t is the z-th column of an L-M x L-M identity matrix, is the first left eigenvector 
of P and its k -th element is fio(k) = D^k- The weighted norm of x is defined by ||cc||^ = x T Wx. 
Similarly, the weighted norm is defined for the l -th view 


L-M 




(ip 




n+j 


i,k) 


k =1 


Mk) 


= e 




~ e i+j^ T P 


( 11 ) 


where l = (l — 1) • M. The main advantage of these distances (Eqs. (10) and ( |ll| )) is that they can be 
expressed in terms of the eigenfunctions and the eigenvectors of the matrix P. This insight allows us to 
use a representation (defined in section |III-E| ) where the induced Euclidean distance is proportional to the 
diffusion distances defined in Eqs. (jTOj) and ( [IT] ). 

Theorem 1. The inner view diffusion distance defined by Eqs. ( (70| ) and ( [77] ) is equal to 

L-M -1 

+ + ( 12 ) 

k= 1 


where l = (l — 1) • M. 

Proof. We express P P _1 (P ) T by P D~ l {P ) T = since = 

n T n = I. Therefore, V t 2 (x\, x l f = 

ll(e f+ . - e l+i yp ‘= (e i+t - e l+j f p‘ d\p‘)L e t+i - e l+j ) = 

(e f+i - e r+J f * M‘9 T (e i+t - e l+J ) = ZtT' A? W + <] - *[( + ]}? = 

l + i\) 2 - 

£ = 0 is excluded due to SI>o = 1 (an all-ones vector) that holds for all stochastic matrices. □ 













E. Multi-view data parametrization 

Tasks such as classification, clustering or regression in a high-dimension feature space are considered 
to be computationally expensive. In addition, the performance of these tasks is highly dependent on the 
distance measure. As explained in section |III-D[ distance measures in the original ambient space are 
meaningless in many real life situations. Interpreting Theorem [T] in terms of Euclidean distance enables 
us to define mappings for every view X l ,l = 1..... L, using the right eigenvectors of P (Eq. §) weighted 
by A*. The representation for instances in X 1 is given by 

i—> [A^i [i + l], ..., + l]] T G M M_1 , (13) 


where l = (l — 1) • M. These L mappings capture the intrinsic geometry of the views as well as the 
mutual relation between them. As shown in OOl . the set of eigenvalues A m has a decaying property such 
that 1 = | A 0 1 > | Ai | > ... > | A a/ _ 1 1. Exploiting the decaying property enables us to represent data up to a 
dimension r where r <C Ni, .... Nl- The dimension r = r(d) is determined by approximating the diffusion 
distance (Eq. ( fl2| )) up to a desired accuracy 6. This argument is expanded in section |IV-D| The reduced 
dimension version of ^ t (X) is denoted by (X). 

Using the inner view diffusion distances defined in Eqs. (jTOj) and (jTT|), we define a multi-view diffusion 
distance as a linear combination of the inner views distances such that 


vr v >\ij) = y_ 


i=i 


(14) 


This distance is the induced Euclidean distance in a space constructed from the concatenation of all low 
dimensional multi-view mappings 

*.(*) = [Cl* 1 ), * A 2 ), $ A 1 )]- as) 


This mapping is used in section |VI-B| for the experimental evaluation of clustering. 


F. Multi-view kernel bandwidth 

When constructing the Gaussian kernels K l ,l = 1, ...,L, in Eq. the values of the scale (width) 
parameter of have to be set. Setting these values to be too small may result in very small local neigh¬ 
borhoods that are unable to capture the local structure around the data point. On the contrary, setting the 
values to be too large may result in a fully connected graph that may generate a coarse description of the 
data. In OTTl . a max-min measure is suggested such that the scale becomes 

erf = C • max[min(||a;- - £c*-|| 2 )], (16) 

3 i,i±3 


where C is set within the range [1,1.5]. This approach attempts to set a small scale to maintain local 
connectivities. This single view approach could be relaxed in the multi-view scenario. The multi-view 
kernel K (Eq. 5 ]) contains multiplication of single view kernel matrices K l ,l = (Eq. [4). The 

diagonal values of each kernel matrix K l , l = 1 are all l’s, therefore, a connectivity in only one 
view is sufficient. This insight suggests that smaller value for the parameter C could be used. 

Another scheme by lf32ll aims to find a range of values for cr/. The idea is to compute the kernel K l 
(Eq. Q) for various values of a and search for the range of values where the Gaussian bell shape exists. 
The range is identified by applying a logarithmic function to the sum of the kernel. We expand this idea 
for a multi-view scenario based on the following algorithm: 
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Fig. 3: Left: an example of the two dimensional function S(oi,cr m ). Right: a slice at the first row (a 2 = 
10 -5 ). The asymptotes are clearly visible in both figures. Algorithm [l] exploits the multi-view to set a 
small scale parameter for both views. 


Algorithm 1 Multi-view kernel bandwidth selection 

Input: Multiple sets of observations (views) X 1 , l = 1..... L. 

Output: Scale parameters for all views {oi, 
i: Compute Gaussian kernels K l (oi),l = 1,..., L for several values of ap 

2: Compute for all pairs l ^ m: S lm (o h a m ) = ( a h °m), where K lm (oi,o m ) = K l (oi) ■ 

3: for l — 1 : L do 

4: Find the minimal value for oi such that S lm (oi, a m ) is linear for all rn / l. 

5: end for 


Note that the two dimensional function S lm (oi,o m ) consists of two asymptotes, S lm (oi,o m ) ai — 
log(iV), and S lm (ai,a m ) cru< RC$ 00 Iog(A r3 ) = 31og(A r ), since for <ji , o m —y 0, both K l and K m approach 
the Identity matrix, and for < 7 /, a m — > oo, both K l and K m approach all-ones matrices. An example of 
the plot S lm (ai,a m ) for two views (L = 2) is presented in Fig. [3} 

IV. Coupled views L = 2 

In this section we provide analytical results for the simple case of a coupled data set (i.e L = 2). Some 
of the results could be expanded to a larger number of views but not in a straight forward manner. To 
simplify the notation in the rest of this section we denote X = X 1 and Y = X 2 . 


A. Coupled mapping 

The mappings provided by our approach (Eq. ( [T3] )) are justified by the relations given by Eq. ( p~2] >. 
In this section, we provide another analytic justification for the proposed mapping. We begin with an 
analysis of a 1-dimensional mapping for each view. Let p(x) = (p(x\), p(x 2 ),p(x M )) and p(y) = 

(p(r/i), p(y 2 ), ... ,p{Vm )) denote such mappings (one for each view) and let p = (p(x), p(y)) and p { = 
(p(xi), p(i/i)). Define K z = K x ■ K y , where K x ,K y are computed based on Eq. Q. Our mapping 
should preserve local connectivities, therefore, we want to ensure that if the data points i and j are close 
in both views, then p, and p f will be close. Minimization of the objective function 


argminj] [(p(®i) - pix^ fR'A + (p(y.) - p(y j )) 2 (K? J ) T 

P id 


(17) 


with additional constraints provides such a connectivity preserving mapping. If AT? • is small indicating a 
low connectivity between data point i and j, the distance between p, and pj can be large. On the other 
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hand, if AT?- is large indicating a high connectivity between point i and j, the distance between p t and 
p J will be small to minimize the objective function. 

Theorem 2. Setting p — tp 1 minimizes the objective function in Eq. 0 where is the second 
eigenvector of the eigenvalue problem XiP = 'if i P. 

Proof 

E[(p(*i) - P(*i) Wj + (p(Vi) - p{Vi)?K?,] = EpixiYKfj + 'Ep(x j ) 2 (K?j) T 
-E 2 p(x i )p(x j )K?j+'52p(y i ) 2K ? J + Y / p(y j ) 2 (K? J ) T - Y,Myi)p(yj) K i,j = 

Y^piXifDir + ZpM 2d jj s - E2 p(vMVi)K?j+ 


h3 


J2p(yi) D i°i + J2p(Vj) Dr jT s ~ T, 2 p(.yi)p(yj) K ?j 


[p(x) p(y)\ 


J 

' j-^rows 

0 MxM 


hJ 


0 MxM 
j-ycols 


0 MxM K 
•^T o 


(Ky 


MxM 


- 

[p( 7 T l 

-1 

[p(y) T \ 


By adding a scaling constrain, the minimization problem is rewritten as 

arpmin p(D — K)p T . 


(18) 


pDp 1 


=i 


This minimization problem can be solved by finding the minimal eigenvalue of (D — K)p T = XDp T since 
the minimization term is p\Dp T = X. This eigenproblem has a trivial solution which is an eigenvector 
of all ones (denoted as 1) with A = 0. The following constraint pDl = 0 was added to remove the trivial 
solution. The solution is given by the smallest non-zero eigenvalue. Multiplying Eq. (18) by D reduces 


the problem to pP = XP. Thus, we are looking for the eigenvector which corresponds to the second 
largest eigenvalue. □ 

Theorem [2] provides yet another justification to use our proposed mapping Eq. ([T3]). 


B. Spectral decomposition 

In this section, we show how to efficiently compute the spectral decomposition of P (Eq. when 
only two view exist (L = 2). The matrix P is algebraically similar to the symmetric matrix P s where 


.1/2 — -—1/2 

P, = D PD 


— 1 / 2 -—- —- — 1/2 

D KD . Therefore, both P and P s share the same set of eigenvalues {A,,, }. 

I 2M—1 


Due to symmetry of the matrix P s , it has a set of 2 M real eigenvalues {A,} ?;=0 1 e M and a corresponding 
real orthogonal eigenvectors {n m } 2 m = o 1 £ K 2M , thus, P s = IIAn T . By denoting \l> = D _1 / 2 n and 
$ = D 1/2 D, we conclude that the set 

{•0 m , <Am}m=o 1 ^ R 2M denotes the right and the left eigenvectors of 

P = fi/Ad? 7 , respectively, satisfying -ipi T <fij = f.j- In the sequel, we use the symmetric matrix P s to 
simplify the analysis. 

To avoid the spectral decomposition of a 2 M x 2 M matrix P s , the spectral decomposition of P s can be 
computed using the Singular Value Decomposition (SVD) of the matrix K z = D row ' s ~ 1 /2 K z D cols l ^ 2 of 
size MxM where D r y s = Jff =1 Kj - and Dfj s = YmLi K i,j are diagonal matrices. Theorem 3 enables 
us to form the eigenvectors of P as a concatenation of the singular vectors of K z = K x ■ K y . 

Theorem 3. By using the left and right singular vectors of K z = V"EU T , the eigenvectors and the 
eigenvalues of K are computed explicitly by 

1 

71 


n 


V 

V 

) A = 

s 

0 MxM 

u 

-u 

_0 MxM 

— E 


(19) 
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Proof. Both V and U are orthonormal sets, therefore, ufu.j = S h j, and vfi'vj = 5 1>V thus, the set {7v m } 
is orthonormal. Therefore, Iin T = I. By using the construction defined in Eq. (19), IIAlV is computed 


nAn T = 
0 


V V ' 


"s 0 ' 


-y T JJT - 

1 

Vs -vs' 


-y T JJT - 

u -u 


0 -s 


1 

4 

1 

q 

4 

1 _ 

2 

c/s c/s 


1 

4 

1 

Cl 

4 

1 _ 


(2 K 


z\T 


2 K z 

0 


= K. The 0 denotes the M x M matrix of zeros. 


□ 

Thus the proposed mapping in Eq. ( [T3] ). could be computed for L = 2 using the SVD of K z , Eq. ( |T9| ) 
and ^ = D~V* II. 


C. Cross view diffusion distance 

In some physical systems, the observed dataset denoted by X changes over some underlying parameter 
denoted by a. Under this model, we can obtain multiple snapshots for various values of a. Each snapshot is 
denoted by X a . If these datasets are high dimensional, quantifying the amount of change in the datasets is 
a difficult task. This scenario was recently studied in [|29l . It generalizes the diffusion framework for cases 
in which the data changes over the parameter a. An example of such a scenario occurs in hyper-spectral 
images that change over time. The DM framework is applied in [29] to a fixed value of a. Then, by using 
the extracted low dimensional mapping, the Euclidean distance enables to quantify the amount of changes 
over a. This approach is sensitive since every small change in the data can result in different mappings and 
the mappings are extracted independently. Thus, there is no mutual influence on the extracted mapping. 
Our approach incorporates the mutual relations of data within the view and the relations among views. 
This observation enables us to measure in a more robust way the number of variations between two 
datasets that correspond to a small variation in a. We now define a new diffusion distance. This distance 
measures the relation between two views, i.e. between all the data points at different values of a. We 
measure the distance between all the coupled data points among the mappings of the snapshots X a ‘ and 
X am by using the expression 

M 

Vt ( cv )\x a \X am ) = II - $*(*?"*)II 2 - (20) 

2=1 

Our kernel matrix is a product of the Gaussian kernel matrices in each view. If these values of the kernel 
matrices {K xH , K xam ) are similar, this corresponds to similarity between the views inner geometry. The 
right and left singular vectors of the matrix K** 1 K x<Xm will be similar, thus, T> t {CV] will be small. 

Theorem 4. The cross manifold distance (defined in Eq. (J20|)) is invariant to orthonormal transformations 
between the ambient spaces X ai and X am . 


: a ‘. 


jy'x° irn 


Proof. Denote an orthonormal transformation matrix R : X a ‘ —y X am w.l.o.g. by xf" = Rx) 

r ||a;“ m -cc“ m || 2 1 r \\Rxfi —Rxf || 2 , \\xfi-xf\\ 2 ^ am . , 

= exp{ - 2 ip 2 } = exp{ -^—~—} = exp{ -—} = Kfj . The last equal¬ 

ity is due to the orthonomality of R and to the choice 07 = a rn . Therefore, the matrix K z = ( K 1 '"'") 2 
from Eq. (|4]) is symmetric and its right and left singular vectors are equal, i.e. U = V, Eq. ( fT9] ). This 

induces a repetitive form in \l/ = D II ffi] = ffM + i], 1 < i, l < M — 1 —>■ 
thus, V t {CM)2 (X a \X am ) = 0. 


□ 
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D. Spectral decay of K 

The power of kernel based methods for dimensionality reduction stems from the spectral decay of 
the kernel’s eigenvalues. In this section, we study the relation between the spectral decay of the Kernel 
Product (Eq. ([7])) and our multi-view kernel (Eq. ([6])). In section |VI-A 1 [ we evaluate the spectral decay 
empirically using two experiments. The rest of this section is devoted to the theoretical justification for 
the spectral decay of our proposed framework. We start with some background. 

The eigenvalues of P (Eq. (JdJij are real and bounded where 


Theorem 5. 

|Ai| < 1, i = 1,..., 2M. 

A similar proof is given in 


Proof As shown in section 


IV-B 


P is algebraically similar to a symmetric matrix, thus, its eigenvalues 


are guaranteed to be real. Denote by A and ip the eigenvalue and the eigenvector, respectively, such that 
A ip = Pip. Define i 0 = argmax|^[i]| to be the index of the largest entry in ip. The maximal value 


l<i<2M 


2 M 


2 M 




3 = 1 


ip[i o] 


ip {if) can be computed using P from Eq. (6) such that A-0 [io] = Poj'tPlJ] 

3 =i 

2 M _ 2 M _ 

Poj = 1- The first inequality is due to the triangle inequality and the second equality 


< E Pin 

3 = 1 r 1 J i=i ^ 

is due to the kernel normalization by D _1 . 


□ 


Theorem [5] shows that the eigenvalues are bounded. However, bounded eigenvalues are insufficient 
for dimensionality reduction. Dimensionality reduction is meaningful when there is a significant spectral 
decay. 

Defenition 1. Let M. be a manifold. The intrinsic dimension d of the manifold is a positive integer 
determined by how many independent “coordinates” are needed to describe AT Using a parametrization 
to describe a manifold, the dimension of M. is the smallest integer d such that a smooth map /(£) = AT 
£ £ TL d , describes the manifold, where £ £ Vf. 


Our framework is based on a Gaussian kernel. The spectral decay of Gaussian kernels was studied in 
Oil . We use Lemma [2] to evaluate the spectral decay of our kernel. 


Lema 2. Assume that the data is sampled from a manifold with intrinsic dimension d M. Let K 
(section denotes the kernel with an exponential decay as a function of the Euclidean distance. For 

6 > 0, the number of eigenvalues of K° above S is proportional to (log(\)) d . 


Lemma [2] is based on Weyl’s asymptotic law Poll . Let 
r$ = r(S) = max{£ £ N such that |A^| > 5} denotes the number of eigenvalues of K° above 6. 
Kfj = Kf -Kf- corresponds to a single DM view given in 0. Theorem [6] relates the spectral decay 
of the kernel P from (Eq. to the decay of the Kernel Product-based DM (P° Eq. ([ 7 ]) and in 0). 

Lema 3. Let A, B £ M. MxM be such that A, B > 0. Then for any 
1 < k < M - 1 

M—l M—l 

JJ A e (A • B) < J] A,(A o B) (21) 

e=k (=k 


where o is the Kronecker matrix product. 
This inequality is proved in P4ll and P51l . 
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Theorem 6. Multiplying the last M — 1 — r$ eigenvalues of K z is smaller than S M 1 rs . Formally, 

M -1 

n m kx ■ kv ) < 

t=r s 

Proof. Denote by {XfiA)} 1 ^ 1 the eigenvalues of the matrix A. They are enumerated in descending order 
such that A 0 (A) > A 2 (A) > ... > Xm-i(A). We use Lemma[3]to prove Theorem[6]by choosing A = K x 
and B = K v , which are positive semi-definite. K' o K y = K corresponds to the approach in Q. By 
using Lemma [3] and choosing £ = rs in Eq. [2ll we get 

M—l M—l 

n a t (K x ■ K y ) < n h k °) < 6 M ~ i ~ rs . 


Using the kernel matrix spectral decay, we can approximate Eq. ( fl2| ) by neglecting all the eigenvalues 
that are smaller than 8. Thus, we can compute a low dimensional mapping such that 

%(xi) : Xi i—» [A^i[i], A|^ 2 [*], A|^ 3 [i], K-i'&r-iii]] 7 ' e M r_1 . (22) 

This mapping of dimension r provides a low dimensional space which improves the performance and the 
efficiency of various machine learning tasks. The following Lemma introduces an error bound for using 
low dimension mapping S& t (xi). 


Lema 4 . The truncated diffusion distance up to coordinate r defined as 

[DKxuXj )] 2 = II- %(xj) II 2 = -V’Ji]) 2 , 

5=1 

is bounded by the inner view diffusion distance (defined in Eq. (|70|)) 


M-l _ r 

2 • [£ A?«.,[i] - - S 2t ■ (— Aii)] < mxi.x,)? < 2 . 

5 — 1 

where 8 it j is the Kronecker delta function. 

Proof. For the right inequality, clearly 


M—l 

E - «]) 2 , 

5—1 


(23) 


M-l 

[V r t (xi,Xj)] 2 < \p$ M {xi,Xj)¥ = 2 • - V’Ji]) 2 , 

5 — 1 

the equality is a result of the repetitive form of which was defined in Theorem [3] for L = 2. Note that 

— - 1/2 

s = 0 was excluded from the sum as xf 0 = 1 is constant. For the left inequality, using that d> = D II, 

where II is an orthonormal basis defined in Eq. ( [19] ) by using the orthogonality of II we get that 

= d ' nn t d ’ = d , 


which means that 


2M-1 

E ~ V’.M ) 2 = + 


28 , 


hj 


5—0 


Djj Dii 


By the definition of the truncated diffusion distance we have 

2M-1 2M-1 

[Cfte,*i)] 2 = E A E A 


5—0 


5=r+l 


( 24 ) 
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M—l 2M-1 2M-1 r. 

2 ■ E *?(«] - «i ]) 2 - E w.w - v>.bi ) 2 > 2 ■ [ E A ? »,w - v>,bi ) 2 - <s 2 ‘ ■ (^% 

s=l s=0 s—1 ^ 

where D is the minimal value of D, tl and Djj. In the same way, a bound for the truncated diffusion 
distance between and y t is derived. 

□ 


E. Out-of-sample extension 

To extend the diffusion coordinates to new data points without re-applying a large-scale eigendecom- 
position OUl . the Nystrom extension is widely used. Here we formulate the extension method for a 
multi-view scenario. Given the data sets X and Y and new points x ^ X and y Y, we want to 
extend the multi-view diffusion mapping to x and y without re-applying the proposed framework. First, 
we describe the explicit form for the eigenvalue problem for two views X and Y. The eigenvector 
with the corresponding eigenvalue X k satisfies X k 'ip k = -P'Vv By using the definition of P from Eqs. 
and ([6]) we get that 

■ 0mxm K • x y l ,,, 

K y ■ K x 

due to the block form of the matrix P 




XkMM = 

3 


xm v m = J2p(y^Xj)^i[j], 

j 


where f) x k [i\ = i = 1,..., M and = i^ k [i + M], i = 1,..., M. The transition matrices are 


p( x i,Vj 


YK X K V Y K y K x ■ 

\ _ i,s s,j 


Throws 


and p(y i: x :j ) 


jxcols 


The Nystrom extension is an approximated weighted sum of the original eigenvectors. The weights are 
computed by applying the kernel V to the extended data points. For the proposed mapping, the extension 
is defined by 

12 


M x ) = t = Y^^ exp 

A - Xk - - 


Mv) = ^^v{y,x :j )Mi] = j-^^ exp \ ~ 

k j k 3 s 


\X — X. 


2ol 


y s 


2a y 


exp 


exp< — 


\Vs -a/jlr 1 iM±M] 


and the new mapping vector for data point is 

^(*) = [Ai^i(*), X 2 M x ), X 3 Mx), ..., X M -ifM-i{ x )\ e 


2^ J 

Dj 


s -^|| 2 ] 

\Mj] 

(25) 

(26) 

2 "l J 

1 Dj 

>M—1 


(27) 


The new coordinates in the diffusion space are approximated and the new data points x , y have no effect 
on the original map’s structure. 
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F. Infinitesimal generator 

A family of diffusion operators were introduced in [Q. Each operator differs by the normalization 
applied to it. If appropriate limits are taken such that M —» oo, e —> 0 and e = 2a 2 , then from m 
it follows that the DM kernel operator will converge to one of the following differential operators: 1. 
Normalized graph Laplacian. 2. Laplace-Beltrami diffusion. 3. Heat kernel equation. These are proved in 
@. The operators are all special cases of the diffusion equation. This convergence provides not only a 
physical justification for the DM framework, but allows in some cases to distinguish between the geometry 
and the density of the data points. In this section, we study the asymptotic properties of the proposed 
kernel K (Eq. (Ji])) by using only two views, i.e. L — 2. 

We are interested in understanding the properties of the eigenfunctions of the proposed multi-view 
kernel P (Eqs. (ji]), <{6j)) for two views. We assume that there is some unknown mapping (3 : —» M. d 

from view X to view Y that satisfies y, = /3(xf), i = 1,..., M. Each view-specific kernel function has the 
same properties K x = K y = K such that K > 0, K(z) = K(—z) and the kernel is normalized such that 
f Rd K(z)dz = 1. Note that with proper normalizations the Gaussian kernel satisfies these requirements. 
The analysis is performed for data points (aq,..., xm} £ sampled from a uniform distribution over a 
bounded domain in W l . The image of the function /3 is a bounded domain in W l with distribution a(z). 

Theorem 7. The infinitesimal generator induced by the proposed kernel P (Eq. converges when 
M —> oo,e —» 0, e = 2 a 2 = 2o 2 to a “cross domain Laplacian operator”. The convergence is to 
functions f(x) and g(y), which are the eigenfunctions of P. These functions are the solutions of the 
following diffusion like equations: 

(Pf)(xi) = g((3(xi )) + eAy((3(xi))/a((3(xi)) + 0(e 3/2 ), (28) 

(Pg)(yi) = /(/3 _1 (?/i)) + eA?y(/3- 1 (t/ i ))/a(/3(t/ i ) + C>(e 3/2 ), (29) 

where the functions 7,77 are defined 7 (z) = g(z)a(z),rj(z) = f(z)a(z). 

Proof. The eigenfunction of the operator L is defined using the functions f{x) and g(y) by concatenating 
the vectors such that 


h = [/(aq), f(x 2 ),..., f(x M ),g(y 1 ),g(y 2 ), g(vM)\ e r2M • 

By expanding the single view construction presented in fl5l, 031 . The limit of the characteristic equation 


is 


2 M ^ 

E Kijhj 

lim (Lhi) = lim /?,,-—-= lim f(xi 

M—>oo' ’ oo 2iW _ oo V 

e->0 e—>0 S Ki,j e ^° 

3 = 1 


M M 

E E K! f Kl j9 ( Vj ) 

3=11=1 
M M 

E E 

3=1i =1 


,i = l,...,M. 


(30) 


We approximate the summation based on a Riemann integral. Thus, the denominator becomes 


M M 


M 2 e d ^ ti A^Le d J R J R 


j=l £=l 


3 oo e d 

e^0 


K i^ K (v^m )a{y)dadv . 






Using a change of variables z = ——, y — 73 ( 5 ) _|_ ^Jfz, dz = dye d ^ we get 


1 



K[t yr) K { u ^ 1 ) a(v)dsdy = 7a 



K[ ——K (z)a{(3{s) + s/ez)dsdz. 


Using a first order Taylor expansion of a we get 
1 


cd /2 



K 


xA 


jK(z)[a((3(s)) + —z Va(/3(s)) + 0(e)]dsdz, 


s — x\ xr , ,, yfe T 
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using the symmetry of the kernel K(z) we get 

— [ K(z)z T \7a((3(s))dz = 0. 

2 jRd 

Applying the change of variables t = , s = y/et + x,dt = dse d / 2 we get 

~ f K(t)[a(/3(x + y/et)) + (D(e)\dt & a((3(x)) + O(e). 

JR d 

The last transition is based on a Taylor expansion and zeroing out odd moments of K(t). The Riemann 
integral for the nominator from Eq. (f30]) takes the following form 


M M 


1 


M 2 e d 


M —>oo e d 

e-s-0 



j =i £=i 

By applying a change of variables 

, y = (3(8) + yfez, dz = dye d/2 we get 
1 




v'i 


a(y)dsdy. 


_ y - /3(g) _ 

* “ ^ 


e d/2 



K 


s — X 

~bjr 


z )l((3(s) + y/ez)dsdz. 


By using Taylor’s expansion of 7(/3(s)) we get 


fd/2 


[ [ K (^- 7 ^)k(zM(3(s)) + ^z T V^(f3(s)) + \z t Hz + 0(e 3 ' 2 )\dsdz 
J R d J R d V V e ' ^ ^ 


"V*— 

L 


where Hij = is the Hessian. The first term is the integral over K(z), while the second 

[ K(z)z T X7 / y((3(s))dz = 0 

* jR d 


due to the symmetry of the kernel K(z). The last term becomes 




We substitute the results in (L) to get 


^ b08(«)) + j a 7 C8(«)) + o( e 3/2 )]*. 

By applying a change of variables t = s = + x, dt = dse d / 2 and 7 (t/) = g(y)a(y ) we get 

[ K(t)[y(P(x + y^)) + x A 7(/3(* + v^*)) + £>(e 3/2 )]dt. 

jRd A 

By using Taylor’s expansion again we get 

~ f K(t)b[^{x)) + ^ a 7 (^(*)) + + 0(e 3/2 )]dt. 

J A A 

We neglected the order terms when e is raised to a power higher than 3/2. Terms with odd order of t are 
zeroed due to the symmetry of the kernel K. Using the same argument as in the integral (L) we get that 
the nominator is 


l(P( x )) + eA7(/3(®)) + 0 (e 3/2 ) 
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dividing by the denominator we get 

(Pf)(xi) = g(P(xi)) + eA7 (P(xi))/a(P(xi)) + 0(e 3/2 ) 

In the same way, we compute the convergence on g{y i ) 

(■ Pg)(Vi ) = fiP'HVi)) + eA?y(/3- 1 (t/ i ))/a(/3(t/ i ) + 0(e 3/2 ). 

□ 


We ignored in the above computation the following 

• Error due to approximating the sum by an integral, a bound for such error in the single view DM is 
introduced in ll32ll . 

• Deformation due to the fact that the data is sampled from a non uniform density. This changes the 
result by some constant. 

• The data lies on some manifold. This could be dealt by changing the coordinate system and integrating 
on the manifold. 

• When assuming that the data lies on some manifold, the Euclidean distance should be replaced by 
the geodesic distance along the manifold. As in the analysis of Q, this introduces a factor to the 
integral. 


G. The convergence rate 

In Theorem [7J we assume the number of data points M —? oo while the scale parameter e —» 0. In 
practice we cannot expect to have an infinite number of data points. It was shown in Q, lf36ll and others 
that a single view graph Laplacian converges to the laplacian operator on a manifold. It is demonstrated 
in 071 . P8ll that the variance of the error for such operator decreases as M —» oo, but increases as 
e —» 0. The study in P71l proves that for a uniform distribution of data points, the variance of the error is 
bounded by (3( Ml/2 * 1+d/4 , e 1 / 2 ). This bound was improved in P8ll by an asymptotic factor of yfe based 
on the correlation between D~ l and K. 

We now turn our attention to the variance of the multi-view kernel for a finite number of points. Given 
Xi ,...., Xm independent uniformly distributed data points sampled from a bounded domain in M. d . Define 
the multi-view Parzen Window density estimator by 


1 M M 1 

/ \ A 1 1 TS( X _ X( -\ 

KmAx) = )K( 


Vi 


e =i 7=1 




Vi 


)• 


(31) 


We are interested in finding a bound for the variance of K M ,e{x) for a finite value of data points. 


Var(K M ,e ( x )) = 


1 


M A e 2d 


M -Var 


M 


E a '< 


X — Xu 


)K{ 


Vt ~ Vj 

V~z 


< 


M^Ta ■ M3 ■ Var[K-Ry} < —[Var(Kf) ■ ||A?|U + Var(K>) ■ ||A'fll»] < 

^ ■ »i ■ i + ^ ^ ■ “(*>] ^ 

The constants mi and rn 2 are functions of the kernels choice, and the function a of the density of points 
y L , i = 1, ...,M. This bound helps to choose an optimal value for the scaling factor e given the number 
of data points M and the intrinsic dimension d. 
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K 


(32) 


H. Generalized multi-view kernel 

One can consider a more general multi-view kernel such that it enables a transition within views X 
and Y in each time step. Such a kernel will take the following form 

'(1 - a) • (K x ) 2 a ■ K x K y 
a • K y K x (1 - a) • (K y ) 2 

where the parameter a G [0,1] provides a bias for the within view transition probability. This kernel is 

normalized using the sum of rows diagonal matrix D, such that P — D K. For a small value of a , 
the kernel favors the within view transition probability. Therefore share similar properties with the single 
view diffusion process. For a large value of a, the kernel behaves empirically like the multi-view kernel 
K(Eq. @). 


V. Related work 

In this section, we describe alternative frameworks that incorporate multiple views. The first two are 
related to the diffusion process, however, the proposed kernels are not symmetric thus they do not guarantee 
real eigenvalues and eigenvectors. 

A. Unsupervised metric fusion by cross diffusion 

The framework in lfl3ll suggests to construct two matrices for each view P 1 , P 2 , and P 1 , P 2 . The first 

pair is defined as in Eq. ([I}. The second pair is a stochastic matrix computed using only K NN nearest 

neighbors for each instance. The kernels are fused such that 

P \+1 = P 1 • P 2 t ■ TT, (33) 

Pi i = P 2 ■ Pi ■ OPT• (34) 

The diffusion process incorporates two steps (between the view) in each time unit. A convergence of the 
induced diffusion distance is proved for t —> oo in lfT3l . 


B. Common manifold learning using alternating-diffusion 

An alternating diffusion process is proposed in lf39l . The construction is based on fusing the stochastic 
matrices P 2 and P 2 by multiplying the matrices such that P AD = P 1 • P 2 . Assuming that a common 
random variable exists in both views, new results regarding the extraction of underlying hidden random 
parameters are described in ll39ll . The study in [f39ll was inspired by our work by giving a reference to 
the current paper. 


C. Kernel Canonical Correlation Analysis (KCCA) 

The frameworks flOll . Ifldll extend the well know Canonical Correlation Analysis (CCA) by applying a 
kernel function prior to the application of CCA. Kernels 1C 1 and K 2 are constructed for each view as in 
Eq. (|4j) and the canonical vectors rqand v 2 are computed by solving the following generalized eigenvalue 
problem 

r — rrl t / \ r -r r \ -w\ O ~ “I 

Vl 

P2. 


0 MxM 

K 1 K 2 

AT 


'(K 1 + yl) 2 0 M xm 

K 2 K 1 

0 MxM 

W 

— P * 

0 mxM (K 2 + yI) 2 _ 


(35) 


where 7 1 are regularization terms which guarantee that the matrices (K l + 7/) 2 and (K 2 + 7/) 2 are 
invertible. Usually the Incomplete Cholesky Decomposition (ICD) 1140)1 . PHI . Il42l is used to reduce the 
run time required for solving ( |35| ). For clustering tasks, K- means is applied to the set of generalized 
eigenvectors. 
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D. Spectral clustering with two views 


The approach in lH8ll generalizes the traditional normalized graph Laplacian for two views. Kernels 
K 1 and K 2 are computed in each view as in Eq. (jdj). The kernels are multiplied such that W = K 1 ■ K 2 


and 


0 MxM W 
W T 0 MxM 


(36) 


Finally, normalization is done by using D where D l L 
defined as 


^Wij, such that the normalized fused kernel is 

j 


A 


D °' 5 • A- D °' 5 . 


(37) 


Assuming that the number of clusters in the data is Nc, denoting the eigenvectors of A as (jp, i = 1 ,.... 2 M. 
K-Means algorithm is applied to the mapping 


A 




(38) 


where s[i] = The paper in |[T8l is focused on spectral clustering, however, a similar version 


of the kernel from Eq. ( [37] ) is suited for manifold learning as we demonstrate in this study. We use a 
stochastic version of the kernel and extend the construction to multiple views. The stochastic version P 
(Eq. <§) is useful as it provides various theoretical justifications for the multi-view diffusion process. In 


section 


VI this approach is referred as De Sa’s. 


VI. Experimental results 

In this section, we present the experimental results which evaluate our framework. We focus the 
experiments on three tasks in machine learning: clustering, classification and manifold learning. 


A. Empirical evaluations of theoretical aspects 


In the first group of experiments we provide empirical evidence which support the theoretical analysis 
from Section JV] 

1 ) Spectral decay: In Section 


IWDJ an upper bound on the eigenvalues’ decay rate for our multi-view- 
based approach (matrix P Eq. |<5)) is presented. In order to empirically evaluate the decay rate, synthetic 


datasets are generated accompanied by comparison to other approaches. To evaluate the spectral decay 
of P (Eq. (Ji])), P° (Eq. ({t]) and 0) and P + (Eq. (joj)) we compare the spectral decay rate between 


various frameworks on synthetic clustered data drawn from Gaussian distributions. The following steps 
describe the generation of both views denoted by (X, Y ) and referred as View-I (X) and View-II ( Y ), 
respectively: 

1) 6 vectors p J E R 9 , j = 1,..., 6) were drawn from a Gaussian distribution N( 0,8 • 19x9)- These 
vectors are the center of masses of the generated classes. 

2) 100 data points were drawn for each cluster j by using p t , 1 < j < 6, from a Gaussian distribution 
N(p,j,2 ■ I 9X 9 ). Denote these 600 data points by X. 

3) 100 data points were drawn for each cluster j by using /z -, 1 < j < 6, from a Gaussian distribution 
N(p,j, 2 • / 9x9 )- Denote these 600 data points by Y. 

The first 3 dimensions of both views are presented in Fig. |4] We compute the probability matrix for 
each view P x and P y (Eq. (0}), the Kernel Sum approach probability matrix P + (Eq. Q), the Kernel 
Product approach P° (Eq. (|7f) and the proposed approach P. The eigendecomposition is computed for 
all matrices. The resulting eigenvalues’ decay rate are compared with the eigenvalues product from both 
views. To get a fair comparison between all the methods, we set the Gaussian scale parameter o x and 
o y in each view and then use these scales in all the methods. The vectors’ variance in the concatenation 
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Gaussian mixture- X 



Gaussian mixture- Y 
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Fig. 4: The first 3 dimensions of the Gaussian mixture. Both views share the center of masses of the 
Gaussian spread. Top: first view denoted as X. Bottom: second view denoted as Y. The variance of the 
Gaussian in each dimension is 8. 


approach is the sum of variances since we assume statistical independence. Therefore, the following scale 
parameters a 2 = a 2 + a 2 are used. 

The experiment is repeated but this time X contains 6 clusters whereas Y contains only 3. For Y, 
we use only the first 3 center of masses and generate 200 points in each cluster. Figure [5] presents a 
logarithmic scale of the spectral decay for eigenvalues extracted from all methods. It is evident that our 
proposed kernel has the strongest spectral decay. 




21 


10" 

CD 

"O 


O) io¬ 
ns 


— Kernel product (element wise) 

E 

CO 


- Kernel sum 

CD 


— MultiView Diffusion 

ns io- 

> 


— View 1 

ns 

c 


—View 2 

ljj 


— Eigenvalues multiplication 


Number ofeigenvalue 



Fig. 5: Eigenvalues decay rate. Comparison between different mapping methods. Top: 6 clusters in each 
view. Bottom: 6 clusters in X and 3 clusters in Y. 
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View II- Y 
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Fig. 6: The two Swiss Rolls, Top- generated by Eq. (39), Bottom- generated by Eq. (40). 


2) Cross view diffusion distance: In this section, we examine the proposed Cross View Diffusion 
Distance (Section |IV-C|). A swiss roll is generated by using the function 



Xi[l] 


66[i] cos(6*[z]) 

View I: X = 

Xi[ 2] 

= 

h[i] 


Xi[ 3] 


6 9[i\ sin(#[i]) 


(39) 


6i = (1.57r)sj, i = 1, 2, 3,..., 1000, where s, are 1000 data points that spread linearly within the line 
Si —» [1,3]. The second view is generated by the application of an orthonormal transformation to the 
swiss roll and adding Gaussian noise. The function in Eq. ( |40| ) describes the representation of the second 
view 



Vii 1 } 


69 i cos(#j) 

View II: Y = 

Vi[ 2] 

= R 

hi 


Vi[ 3] 


6 9i sin(()j) 


(40) 


where R G R 3x3 is a random orthonormal transformation matrix. It is generated by drawing values from 
i.i.d Gaussian variables and applying the Graham-Schmidt process. hi,i = 1,..., 1000 are drawn from a 
uniform distribution within the interval [0,100]. Each component of N ], N~ G M 3xl is drawn from a 
Gaussian distribution with zero mean and a variance of o%. An example for both Swiss rolls is presented 
in Fig. [§ A standard DM is applied to each view and a 2-dimensional embedding of the Swiss roll is 
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Fig. 7: Comparison between two cross view diffusion based distances. Simulated on two Swiss rolls with 
additive Gaussian noise. The results are the median of 100 simulations. 


extracted. The sum of distances between all the data points in the embedding spaces is denoted as a single 
view diffusion distance (SVDD). The distance is computed using the following measure 

M 

Y) = !!»,(*,) - *,(?/ i )l| 2 , (41) 

1=1 


where i = 1,..., M a the single view diffusion mappings. Then, the proposed framework is 

applied to extract the coupled embedding. A Cross View Diffusion Distance (CVDD) is computed using 
Eq. (20). This experiment was executed 100 times for various values of the Gaussian noise variance a 2 N . 

In about 10% of the single view simulations the embeddings’ axis are flipped. This generates a large 
SVDD although the embeddings share similar structures. In order to remove these type of errors we use 
the Median of 100 simulations. The median of the results are presented in Fig. |7J 


B. Multi-view clustering 

The task of clustering has been in the core of machine learning for many years. The goal is to divide a 
given data set into subsets based on the inherited structure of the data. We use the multi-view construction 
to extract low dimensional mappings from multiple sets of high dimensional data points. In the following 
experiments we demonstrate the advantage of the proposed approach on both artificial and real data sets. 
For the real data sets applying the multi-view approach requires an eigen decomposition of large matrices. 
To reduce the runtime of experiments we us an approximate matrix decomposition based on sparse random 
projections ll43l . 

1) Two circles clustering: Spectral properties of data sets are useful for clustering since they reveal 
information about the unknown number of clusters. The characteristic of the eigenvalues of P (Eq. 0) 
can provide insight into the number of clusters within the data set. The study in ll44ll relates the number 
of clusters to the multiplicity of the eigenvalue 1. A different approach in B51 provides an analysis about 
the relation between the eigenvalue drop to the number of clusters. In this section, we evaluate how our 
proposed method captures the clusters’ structure when two views are available. 

We generate two circles that represent the original clusters using the function 
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Fig. 8: Left: first view X. Right: second view Y. The ground truth clusters are represented by the marker’s 
shape and color. 


'Zi[l] 


r • cos (9i)) 

Zil 2]_ 


1 

CO 

1 _ 


(42) 


where 1600 points 1 < i < 1600, are spread linearly within the line [0, 47t]. The clusters are created 
by changing the radius as follows: 

r = 2,1 < i < 800 (first cluster) , r = 4,801 < i < 1600 (second cluster). The views X (Eq. (|43])) 
and Y (Eq. ( |44| )) are generated by the application of the following non-linear functions that produce the 
distorted views 

Xi[l] = 


z i [i] + 1 + Tii[ 2] \ z i[2] > 0 

z\[i\ + nj[3]|zj[ 2 ] < 0 


,Xi[ 2] = Zi[2] +ni[l] 


(43) 


and 


y%[ 1] = z%[ 1] +n i [4],y i [2] 


$ z i [2] + 1 — /q[6]|c,[l] > 0\ 
1 Zi[2] + nj[6]|^[l] < 0 /’ 


(44) 


where n t [/], 1 < l < 6, are i.i.d random variables drawn from a Gaussian distribution with /.i = 0 and 
o 2 n G [0.03,0.6]. This data is referred as the Coupled Circles dataset. 

In Fig. [8| the views X and Y, which were generated by Eqs. ( |43] ) and < |44] ), are presented. Color 
and shape indicate the ground truth clusters. Initially, DM is applied to each view and clustering is 
performed using K-means (K = 2) within the first diffusion coordinate. The kernel bandwidths a x and o y 


. m. 

= Jo 


We use t = 1 since it is 
We further extract a 


+ a r 


for all methods are set using the min-max method described in Eq 
optimal for clustering tasks. For the kernel product method we use a, 

1-dimensional representation using the proposed multi-view framework (Eq. ([T3])), the Kernel Sum DM 
(Eq. ([£])), Kernel Product DM (Eq. (|7]>), De Sa’s approach (Eq. ( [37] )) and Kernel CCA (Eq. ([35])) described 
in Section [Vj The regularization parameter is 7 = 0.01 for KCCA and we use 100 components for the 
Incomplete Cholesky Decomposition ll40l . iPffl . Clustering is performed in the representation space by the 
application of K-means where K = 2. To evaluate the performance of our proposed map 100 simulations 
with various values of the Gaussian’s noise variance (all with zero mean) were performed. The average 
clustering success rate is presented in Fig. [9} It is evident that the multi-view based approach outperforms 
the DM-based single view and the Kernel Product approaches. 

The performance of kernel methods is highly dependent on setting an appropriate kernel bandwidth 
o x ,o y , in Algorithm [T] we have presented method for setting such parameters. To evaluate the influence 
of such parameters on the clustering quality we set o n = 0.16 and extract the multi-view, Kernel Sum 
and Kernel Product diffusion mapping for various values of o x ,o y . The average clustering performance 
using Kmeans K = 2 are presented in Fig. [lOj 
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Fig. 9: Clustering results from averaging 200 trials vs. the variance of the Gaussian noise. The simulation 


performed on the Coupled Circles data (Eqs. (42), (43) and (44)). 



Fig. 10: Clustering results from averaging 20 trails using various values of a x ,a y based on different 
mappings. The standard deviation of the noise is a n = 0.16. Left- Kernel Sum DM, middle- Kernel 
Product DM, right- Multi View DM. 


2) Handwritten digits : For the following clustering experiment, we use the Multiple Features database 
ll46ll from the UCI repository. The data set consists of 2000 handwritten digits from 0 to 9 that are 
equally spread. The features extracted from these images are the profile correlations (FAC), Karhunen- 
love coefficients (KAR), Zerkine moment (ZER), morphological (MOR), pixel averages in 2 x 3 windows 
and the Fourier coefficients (Fou) as our feature spaces X 1 , X 2 ,X 3 , X 4 , X 5 , X 6 respectively. We apply 
dimensionality reduction using a single view DM, Kernel Product DM, Kernel Sum DM and the pro¬ 
posed Multi-view. We apply K-means to the reduced mapping using 6 to 20 coordinates. The clustering 
performance is measured using the Normalized Mutual Information li47)l (NMI). Figure [TT] presents the 
average clustering results using K-Means (K=l). 

3) Isolet data set : The data set was constructed by recording 150 people pronouncing each letter 

twice for all 26 letters. The feature vector available is a concatenation of the following features: spectral 

coefficients, contour, sonorant, pre-sonorant and post-sonorant. The authors do not provide the feature’s 

separation, therefore, the dimension of the feature vector is 617. We use a subset of the data with 1599 
instances, thus the features space is X G E 1559x617 . To apply the multi-view approach we compute 3 
different kernels and fuse them together. The first kernel K 1 is the standard Gaussian kernel defined in 
Eq. g. K 2 is a Laplacian kernel defined by 


A 


K-- = exp{ 


~ \X{ ~ Xj | 
^2 


}• 


(45) 


The third kernel K 3 is an exponent with a correlation distance as the affinity measure, given by 


Kfj = exp 


T - 1 

J -i,3 _ 

2a 2 




(46) 
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Fig. 11: Average clustering accuracy running 100 simulations on the Handwritten data set. Accuracy is 
measured using the Normalized Mutual Information (NMI). 


c 



Fig. 12: Clustering accuracy measured with Normalized Mutual Information (NMI) on the Isolet data set 
by using 3 different kernel matrices. Clustering was performed in the r dimensional embedding space. 


where T it j is the correlation coefficient between the i- th and j-th feature vectors, computed by 

= ,i,j = 1 , •••, M. 


T ■ = 


~ T ~ 
Xj * Xn 


(47) 


* xfj {Xj ' Xj 


The average subtracted features are cc* = Xi — rji ■ 1, where r/,; is the average of the features for instance i. 
We fuse the kernels using multi-view, kernel product and kernel sum approach, we then apply K-Means 


to the extracted space. The average NMI for 26 classes is presented in Fig. 12 


C. Manifold learning 

The power of manifold learning appears when the data is sampled in a high dimensional space but 
actually lies on a low dimensional surface. Extracting the underlying surface provides insight into the 
physical process creating the data. Examples in vision Il48ll . audio Il49ll . medical Il50ll and more. In this 
section we demonstrate the proposed approach on an artificial manifold and on a toy example of video 
sequence. 
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X- First View 


Y- Second View 






Fig. 13: Top: first Helix X (Eq. (48)). Bottom: second Helix Y (Eq. (|49|)). Both manifolds have some 
circular structure governed by the angle parameter a[i] and b[i\, i = 1,2,3,..., 1000 colored by the points 
index i. 


1 ) Artificial manifold learning: In the general DM approach, there is an assumption that the sampled 
space describes a low dimensional manifold. However, this assumption can be incorrect since the sampled 
space can describe the existence of redundancy in the manifold, or more generally, the sampled space can 
describe two or more manifolds generated by a common physical process. In this section, we examine the 
extracted embedding computed using our method and compare it to the Kernel Product approach (section 


III-B). 


Helix A 

Two coupled manifolds with a common underlying open circular structure are generated. The helix shaped 
manifolds were generated by the application of a 3 dimensional function to 1000 data points that spread 
linearly within the lines a* —» [0, 27t] and 6* = a* + 0.57T mod 2n, i = 1,2,3,..., 1000. The functions 
in Eqs. (|48]) and ( |49] ) are used to generate the datasets for View-I and View-II denoted as X and Y, 
respectively: 


View I: X = 

xfil] 

Xi[2] 

fii[ 3]_ 

= 

4 cos(0.9ctj) + 0.3 cos(20aj) 
4sin(0.9aj) + 0.3sin(20aj) 
0.1(6.3a^ — af) 

, * — 1,2,3,. 

..,1000, 

(48) 

View II: Y = 

Vi[ l \ 
Vi[ 2] 

Vi[ 3] 

= 

4cos(0.96j) + 0.3cos(20&i) 
4 sin(0.96j) + 0.3 sin(206j) 
0.1(6.36^ — 

,* = 1,2,3,. 

..,1000. 

(49) 


The 3-dimensional Helix shaped manifolds X and Y are presented in Fig. |_ 

The Kernel Product mapping (Eq. Q) separates the manifold to a bow and a point as shown in Fig. 


15 This structure neither represents any of the original structures nor reveals the underlying parameters 


a*,6j. On the other hand, our embedding (Eq. ( [13] )) captures the two structures one for each view. As 
shown in Fig. [14J one structure represents the angle of a, while the other represents the angle of b, . The 
Euclidean distance in the new spaces preserves the mutual relations between data points based on the 
geometrical relation in both views. Moreover, both manifolds are in the same coordinate system and this 
is a strong advantage as it enables us to compare between the manifolds in the lower dimensional space. 
The Euclidean distance in the new spaces preserves the mutual relations between data points that are 
based on the geometrical structure of both views. 

Helix B 

The previous experiment was repeated with the functions in Eqs. ( |50| ) and ( [5Tj ) to generate datasets for 
View-I and View-II denoted by X and Y, respectively. 


Xi[ 1] 


4 cos(5a*) 

Xi[ 2] 

= 

4sin(5ctj) 

*i[3]_ 


4a ( 


View I: X 


(50) 
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Fig. 14: Top: Multi-View based embedding of the first view *F(X). Bottom: Multi-View based embedding 
of the second view ^(Y). They were computed by using Eq. (13)), respectively. 
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Fig. 15: 2-dimensional DM-based mapping of the Helix computed using the concatenated vector from 
both views that correspond to the kernel P° (Eq.(|7j)). 
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X- First View 



CO 


X 


Y- Second View 



Fig. 16: Top: first Helix X (Eq. (50)). Bottom: second Helix Y (Eq. (|5T|)). Both manifolds have some 
circular structure governed by the angle parameter a[i] and b t , i = 1,2,3,..., 1000, as colored by the 
point’s index i. 



Fig. 17: The coupled mappings computed using our proposed parametrization in Eq. 


13 



vA l ] 


4cos(56;) 

View II: Y = 

Vi[ 2] 

= 

4sin(56j) 




4 k 


(51) 


Again, 1000 points were generated using a* — > [0, 27r], 6* = a* + 0.57T mod 2n, i = 1,2,3,..., 1000. The 
generated manifolds are presented in Fig. [16} 

As can be viewed in Fig. [17} the proposed embeddings (Eq. ( |T3| ) has successfully captured the governing 
parameters a* and bi. The Kernel Product based embedding (Eq. ([7])) is presented in Fig. [18} The Kernel 
Product based embedding again separated the data points into two unconnected structures that do not 
represent well the parameters. 

2) MultiView video sequence: Various examples such as images, audio, MRI ll28ll . Il8l and lf5Tll have 
demonstrated the power of DM for extracting from real datasets the underlying changing physical param¬ 
eters. In this experiment, the multi-view approach is tested on a real data. Two web cameras and a toy 
train with a pre-designed path are used. The train’s path has an “eight” shape structure. Extracting the 
underlying manifold from the set of images enables us to organize the images according to the location 
along the train’s path and thus reveals the true underlying parameters of the processes. 
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Fig. 18: A 2-dimensional mapping, extracted based on P° (Eq. (17 


The setting of the experiment is as follows: each camera records a set of images from a different 
angle. A sample frame from each view is presented in Fig. [I9j The video is sampled at 30 frames per 
second with a resolution of 640x480 pixels per frame. M = 220 images were collected from each view 
(camera). Then, the R,G,B values were averaged and downsampled to 160X120 pixels resolution. The 
matrices were reshaped into column vectors. The resulted set of vectors are denoted by X and Y where 


Xi,Vi e 


^>19200 


1 < i < 220. The sequential order of the images is not important for the algorithm. In 


a normal setting, one view is sufficient to extract the parameters that govern the movement of the train 
and thus extract the natural order of the images. However, we use two types of interferences to create 
a scenario in which each view by itself is insufficient for the extraction of the underlying parameters. 
The first interference is a gap in the recording of each camera. We remove 20 consecutive frames from 
each view at different time locations. By doing it, the bijective correspondence of some of the images 
in the sequence is broken. However, even an approximated correspondence is sufficient for our proposed 
manifold extraction. A standard 2-dimensional DM base mapping of each view was extracted. The results 
are bow shaped manifolds as presented in Fig. [20} Applying DM separately to each view extracts the 
correct order of the data points (images) along the path. However, the “missing” data points broke the 
circular structure of the expected manifold and resulted in a bow shaped embedding. We use the multi-view 
based methodology to overcome this interference by application of the multi-view framework to extract 
two coupled mapping (Eq. (fT3])). The results are presented in Fig. [21} The proposed approach overcomes 
the interferences by smoothing the gap inherited in each view through the use of connectivities from the 
“undistracted” view. Finally, we concatenate the vectors from both views and compute the Kernel Product 
embedding The results are presented in Fig. [22} Again, the structure of the manifold is distorted and 
incomplete due to the missing images. 

This experiment was repeated while replacing 10 frames from each view with a Gaussian noise that has 
the parameters p, = 0 and a 2 = 10 that are average and variance, respectively. A single view DM-based 
mapping was computed. The Kernel Product-based DM and the multi-view based DM mappings were 
computed as well. As presented in Fig. [23} the Gaussian noise distorted the manifolds extracted in each 
view. The multi-view approach extracted two circular structures presented in Fig. [24} Again, the data 
points are ordered according to the position along the path. This time, the circular structure is unfolded 
and the gaps are visible in both embeddings. Applying the Kernel Product approach (Eq. [7]) has yielded 
a distorted manifold as presented in Fig. [~~ 


D. Classification in the embedding space 

Besides improving classification results, dimensionality reduction can reduce the execution time. Studies 
such as ll52j . It53ll and ll54ll focus on the role of dimensionality reduction for classification. Applications 
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Fig. 19: Top: a sample image from the first camera (X). Bottom: a sample image from the second camera 
00 . 
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Fig. 20: Top: DM-based single view mapping \&(X). Bottom: DM-based single view mapping x l>(Y)'). 
The removed images caused a bow shaped structure. 
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Fig. 21: Top: Mapping 'k(X). Bottom: Mapping ^{Y) as extracted by the multi-view based framework. 
Two small gaps, which correspond to the removed images, are visible. 
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Fig. 22: A standard diffusion mapping (Kernel Product-based) that was computed by using the concatenated 
vector from both views that correspond to kernel K°. 
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Fig. 23: Top: DM-based single view mapping 4/ (X). Bottom: DM-based single view mapping fy(Y )). 
The Gaussian noise deformed the circular structure 
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Fig. 24: Top: Mapping ff'(X). Bottom: Mapping fy( Y) as extracted by the multi-view framework. Two 
gaps are visible that correspond to Gaussian noise. 
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Kernel Product Diffusion Map- 4/° 



Fig. 25: Computation of a standard diffusion mapping (Kernel Product) by using the concatenation vector 
from both views (corresponding to kernel P° Eq.(|7j)). 

TABLE I: Classification accuracy using 1-fold cross validation, r is the number of coordinates used in 
the embedding space. 


Method 

Accuracy [%] (r = 3) 

Accuracy [%] (r = 4) 

Single View DM (X 1 ) 

89.9 

93 

Single View DM (X 2 ) 

88.6 

92.4 

Single View DM (X 3 ) 

89.3 

91.1 

Single View DM (X 4 ) 

89.3 

89.2 

Single View DM (X 5 ) 

89.3 

90.5 

Single View DM (X 6 ) 

88.6 

91.1 

Kernel Sum DM 

93.7 

94.9 

Kernel Product DM 

94.3 

93 

Multi-view DM 

97.5 

98.1 


have been studied in diverse fields m ® IE1. 

1 ) Classification of seismic events: Various studies have used machine learning for seismic events 
classification. Some of the methods used are: artificial neural networks 115711 . If58ll . |(59ll . self-organizing 
maps [(601 . ll6Tll . hidden Markov models l(62l . I(63ll and support vector machines ll64l . We use a data 
set collected from a seismic catalog by the Geophysical Institute of Israel. The data set includes 46 
earthquakes, 62 explosions and 62 noise waveforms. Each waveform was sampled at a frequency of 40 
Hz. The length of each waveform is one minute, thus each consists of 2400 samples. Data was collected 
from two stations. Each station uses a three component seismometer concluding a total number of 6 views. 

The features extracted from the waveform are termed Sonograms l(65l . First, a Spectrogram is computed 
by using the short time Fourier transform (STFT). The frequency scale is rearranged to be equally 
tempered on a logarithmic scale, such that the final spectrogram contains 11 frequency bands. The bins 
are normalized such that the sum of energy in every frequency band is equal to 1. The resulting set of 
sonograms denoted X 1 , X 2 , X 3 , X 4 , X 5 , X 6 . These are the input views for our framework. 

We apply the proposed framework, as well as the single view DM, kernel product DM and kernel sum 
DM. Classification is performed by using K-NN (K=l), based on 3 or 4 coordinates from the reduced 
mapping. The results are presented in table |IJ 

To evaluate the multi-view classification performance on subsets of the L = 6 views we use subsets 
of only two view. We evaluate the classification results based on all the pairs of view X 1 , X m ,l,m = 
1, f m. The accuracy of classification based on the multi-view representation V F(X / ), l = 1, ...,6 

given that X m , m = 1,..., 6, m f l is presented in Fig. |26j 
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Fig. 26: Classification accuracy using K-nn (K=l) for all pairs of views X l ,X m ,l ^ m. The y-axis is 
the number of the first view used, while the x-axis is the number of the second view. Classification is 
performed in the multi-view low dimensional embedding (r = 4). The diagonal terms are presented as 
zero since we did not simulate for l = m. 


VII. Discussion 

In this paper, we presented a framework for dimensionality reduction that is multi-view based. The 
method enables us to extract simultaneous embeddings from coupled embeddings. We enforce a cross 
domain probabilistic model at a single time step. The transition probabilities depend on the connectivities 
in both views. We derived various theoretical aspects of the proposed method and demonstrated their 
applicabilities to both artificial and real data. The experimental results demonstrate the strength of the 
proposed framework in cases where data is missing in each view or each of the manifolds is deformed by 
an unknown function. The framework is applicable to various real life machine learning tasks that consist 
of multiple views or multiple modalities. 
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